V. CONCLUSION AND FUTURE WORK
The evaluation shows that the CENFA learner provides a
combination of the strengths of ensemble and active learning.
It is able to increase efficacy and efficiency compared to
pure ensemble and active learning respectively. Compared to
a standard combination of ensemble and active learning it
almost retains the effectiveness and increases the efficiency
substantially. In terms of time, the CENFA is up to 100.000
times faster.
The provided solution of the CENFA learner was created
to classify Web documents and especially job offers. The
approach needs to be evaluated in other domains of Web
documents. It would also be interesting to apply this method in
different classification scenarios, where entities other than Web
documents have to be classified. The concept is independent
from the underlying algorithm used (SVM), hence different
algorithms can be tested in such an environment. Even base
and specialized classifier could be applied using different
algorithms. CENFA was evaluated with pre-annotated training
sets simulating human feedback. Applying the algorithm in
an actual active learning environment is a required step of
evaluation in order to prove its suitability in a real-world
scenario.
ACKNOWLEDGEMENTS
The work presented in this paper was partly funded by
the German Federal Ministry of Education and Research
(BMBF) under grant no. 01IS12054 and partially funded in
the framework of Hessen Modell Projekte, financed with funds
of LOEWE-State Offensive for the Development of Scientific
and Economic Excellence (HA project no. 292/11-37). The
responsibility for the contents of this publication lies with
the authors. We thank kimeta GmbH for the essential help
assisting with building the evaluation corpus.
REFERENCES
[1] Netcraft, “November 2013 web server survey,” http://news.netcraft.
com/archives/2013/11/01/november-2013-web-server-survey.html, year
2013, [Online; accessed 18-November-2013].
[2] C. D. Manning, P. Raghavan, and H. Sch
¨
utze, Introduction to
information retrieval. Cambridge University Press Cambridge, 2008,
vol. 1.
[3] G. Salton and C. Buckley, “Term weighting approaches in automatic
text retrieval,” Information Processing Management, vol. 24, no. 5, pp.
513–523, 1988.
[4] T. Joachims, “A statistical learning learning model of text classification
for support vector machines,” in Proceedings of the 24th annual
international ACM SIGIR conference on Research and development
in information retrieval, 2001, pp. 128–136. [Online]. Available:
http://dl.acm.org/citation.cfm?id=383974
[5] N. Tripathi, M. Oakes, and S. Wermter, “A fast subspace text
categorization method using parallel classifiers,” in Computational
Linguistics and Intelligent Text Processing. Springer, 2012, pp.
132–143. [Online]. Available: http://link.springer.com/chapter/10.1007/
978-3-642-28601-8 12
[6] F. Fukumoto, Y. Suzuki, and S. Matsuyoshi, “Text classification from
positive and unlabeled data using misclassified data correction,” in
Proceedings of the the 51st Annual Meeting of the Association for
Computational Linguistics (ACL 2013), 2013, pp. 474–478.
[7] I. H. Witten and E. Frank, Data Mining: Practical machine learning
tools and techniques. Morgan Kaufmann, 2011.
[8] C. C. Aggarwal, Mining text data. Springer, 2012.
[9] B. Settles, M. Craven, and L. Friedland, “Active learning with
real annotation costs,” in Proceedings of the NIPS Workshop
on Cost-Sensitive Learning, 2008, pp. 1–10. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1557119
[10] Y. Fu, X. Zhu, and B. Li, “A survey on instance selection for
active learning,” Knowledge and Information Systems, vol. 35, no. 2,
pp. 249–283, May 2013. [Online]. Available: http://link.springer.com/
article/10.1007/s10115-012-0507-8
[11] B. Yang, J.-T. Sun, T. Wang, and Z. Chen, “Effective multi-label
active learning for text classification,” in Proceedings of the 15th ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining, ser. KDD ’09. New York, NY, USA: ACM, 2009, pp. 917–926.
[Online]. Available: http://doi.acm.org/10.1145/1557019.1557119
[12] B. Settles, “Active learning literature survey,” University of Wisconsin
on Active Learning, Madison, 2010.
[13] J. Zhu and M. Ma, “Uncertainty-based active learning with instability
estimation for text classification,” ACM Trans. Speech Lang. Process.,
vol. 8, no. 4, pp. 5:1–5:21, Feb. 2012. [Online]. Available:
http://doi.acm.org/10.1145/2093153.2093154
[14] X. Li and C. G. Snoek, “Classifying tag relevance with relevant
positive and negative examples,” in Proceedings of the 21st ACM
International Conference on Multimedia, ser. MM ’13. New
York, NY, USA: ACM, 2013, pp. 485–488. [Online]. Available:
http://doi.acm.org/10.1145/2502081.2502129
[15] S. Schnitzer, “Effective classification of ambiguous web documents
incorporating human feedback efficiently,” Master’s thesis, University of
Applied Sciences Darmstadt, Faculty of Computer Science, Darmstadt,
Germany, 2013.
[16] J. Platt, “Fast training of support vector machines using sequential
minimal optimization,” in Advances in Kernel Methods - Support Vector
Learning, B. Schoelkopf, C. Burges, and A. Smola, Eds. MIT Press,
1998. [Online]. Available: http://dl.acm.org/citation.cfm?id=299105
45 Polibits (49) 2014ISSN 1870-9044
Combining Active and Ensemble Learning for Efficient Classification of Web Documents